Large scale machine learning
__TOC__ Large scale machine learning A general introduction to large scale machine learning (LSML) can be found at: *Minning data sets , chapter 12: Large Scale Machine Learning *A course by John Langford and Yann LeCun at NYU : Big Data, Large Scale Machine Learning, the lectures are available here *Large Scale Machine , by Ronan Collobert : A large scale ML thesis with special focus on SVMs *Map-Reduce for Machine Learning. One way to cope with LSML is to use optimization theory for large scale systems, we'll explore this in the "Large scale optimizaiton" section. Another way to cope with LSML is to use decission trees. Large scale optimization The theory that makes it possible for machine learning to work at big scales is studied in detail in the following references: #In 28 Bertsekas, et. al. describe the state-of-the-art of Parallel and Distributed Numerical Methods (at least at that time, 1989) #Optimization Algorithms in Machine Learning , by Stephen Wright #Vandenberghe course on Optimization Methods for Large-Scale Systems #Scaling Up Machine Learning : A video from the Ron Bekkerman, the editor of the book "Scaling up Machine Learning " #Landson's book Optimization Theory for Large Systems and course Large-Scale Systems Optimization #The latest bet from Boyd : Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers #On Recent Trends in Large-Scale Convex Optimization #A good recopilation of resources regarding large scale optimization. When talking about large scale optimization it is useful to differentiate between the following categories of algorithms: *Fast but not scalable : This are algorithms that draw fast convergence theorems from optimization theory to make algorithms that (assuming the availablity of all the data in a RAM) find a maximum/minimum very fast. *Scalable but not distributed : This are algorithms that use special optimiziation theory ( e.g. stochastic approximation theory ) to process the data in chunks that fit the RAM of a computer. *Distributed : This are algorithms that divide the optimization/learning problem into small data-independent sub-problems that can be solved on different computers. While distributed algorithms should, in general, be fast (thanks to the fact that many nodes can be used); fast algorithms are not necessarily distributed. Online algorithms (scalable) Online algorithms process only the data available (in RAM or in time). Depending on their nature they may adapt to the new trends of the data or hold to an initial trend. Yahoo! and Bottou have made interesting applications of online algorithms. 'Stochastic gradient descent' The theory behind SGD is given by the stochastic approximation 20 area of research. Some of the interesting research topics here are: *Adaptative step sizes 21,23 *Initial guesses *Distributing/Averaging SGD 16,17,27 : This has been used in the representation learning literature section 3 *Multivariate SGD 19 The implementation of SGD in BIGS is described, along with a review of some literature relevat for optimization using SGD, in this blog post . 'Passive-aggresive algorithms' This algorithms draw from the theory of 14 Distributed algorithms Coordinate gradient descent Section 2.7 Alternating Direction Method of Multipliers 18 Incremental proximal methods ?? 26 In distributed 29 "... hybrid online+batch algorithm with non-uniform parameter averaging" is proposed. In 30 a bundle method is proposed in which parallelization is achived by distributing the calculation of the empirical risk. "Instead of minimizing g(w) directly, cutting plane methods minimize it approximately by iteratively solving a linear program arising from its lower bound. Bundle methods are cutting plane methods stabilized with the Moreau-Yosida regularizer" 30. Software *Wopal wabbit *GraphLab * Large scale matrix factorization There are dozens of matrix factorization algorithms finding their way on many applications (as outlined in The Matrix Factorization Jungle ), specially on machine learning 12. This algorithms usually rely on the large scale optimization theory mentioned above, and so we'll divide the algorithms for this application in the above mentioned categories. Fast algorithms The most famous NMF algorithm is the Lee-Seung 8 multiplicative (or additive) update rules. A review of fast algorithms to solve NMF can be found at 2;9. The fastest algorithm to date is the HALS algorithm 10 Online algorithms An interesting online approach to matrix factorization is 15 Stochastic Gradient Descent 6,7,11,13 Distributed algorithms Using coordinate descent 3,4 *Projected gradient 5 References 1 D. P. Bertsekas, Nonlinear Programming. Belmont, MA, 02178-9998: Athena Scientiﬁc, second ed., 1999. 2 Berry, Michael W., et al. "Algorithms and applications for approximate nonnegative matrix factorization ." Computational Statistics & Data Analysis 52.1 (2007): 155-173. 3Hsieh, Cho-Jui, and Inderjit S. Dhillon. "Fast coordinate descent methods with variable selection for non-negative matrix factorization ." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011. 4 Yu, Hsiang-Fu, et al. "Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems. " Data Mining (ICDM), 2012 IEEE 12th International Conference on. IEEE, 2012. 5 Lin, Chih-Jen. "Projected gradient methods for nonnegative matrix factorization. " Neural computation 19.10 (2007): 2756-2779. 6 Gemulla, Rainer, et al. "Large-scale matrix factorization with distributed stochastic gradient descent. " Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011. 7 Recht, Benjamin, and Christopher Ré. "Parallel stochastic gradient algorithms for large-scale matrix completion ." Mathematical Programming Computation (2011): 1-26. 8 Liu, Chao, et al. "Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. " Proceedings of the 19th international conference on World wide web. ACM, 2010. 9 Kim, Jingu, and Haesun Park. "Toward faster nonnegative matrix factorization: A new algorithm and comparisons ." Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 2008. 10 Cichocki, Andrzej, and P. H. A. N. Anh-Huy. "Fast local algorithms for large scale nonnegative matrix and tensor factorizations ." IEICE transactions on fundamentals of electronics, communications and computer sciences 92.3 (2009): 708-721. 11 Westfall, Patrick J., Daniel R. Ballon, and Jeremy Thorner. "When the stress of your environment makes you go HOG wild. " Science 306.5701 (2004): 1511-1512. 12Srebro, Nathan. Learning with matrix factorizations . Diss. Massachusetts Institute of Technology, 2004. 13 14 Crammer, Koby, et al. "Online passive-aggressive algorithms ." The Journal of Machine Learning Research 7 (2006): 551-585. 15 Mairal, Julien, et al. "Online learning for matrix factorization and sparse coding ." The Journal of Machine Learning Research 11 (2010): 19-60. 16 Polyak, Boris T., and Anatoli B. Juditsky. "Acceleration of stochastic approximation by averaging ." SIAM Journal on Control and Optimization 30.4 (1992): 838-855. 17 Le Breton, Alain, and Aleksandr Aleksandrovich Novikov. "On stochastic approximation procedures with averaging ." Theory of Probability & Its Applications 44.3 (2000): 591-604. 18 Boyd, Stephen, et al. "Distributed optimization and statistical learning via the alternating direction method of multipliers ." Foundations and Trends® in Machine Learning 3.1 (2011): 1-122. 19 Spall, James C. "Multivariate stochastic approximation using a simultaneous perturbation gradient approximation ." Automatic Control, IEEE Transactions on 37.3 (1992): 332-341. 20 Kushner, Harold J., and George Yin. Stochastic approximation and recursive algorithms and applications . Vol. 35. Springer, 2003. 21 Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient methods for online learning and stochastic optimization ." The Journal of Machine Learning Research 999999 (2011): 2121-2159. 22 Bengio, Yoshua. "Deep learning of representations: Looking forward ." arXiv preprint arXiv:1305.0445 (2013). 23 Zeiler, Matthew D. "ADADELTA: An Adaptive Learning Rate Method ." arXiv preprint arXiv:1212.5701 (2012). 24 Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient methods for online learning and stochastic optimization ." The Journal of Machine Learning Research 999999 (2011): 2121-2159. 25 Spall, James C. Introduction to stochastic search and optimization: estimation, simulation, and control . Vol. 65. Wiley-Interscience, 2005. 26 Bertsekas, Dimitri P. "Incremental proximal methods for large scale convex optimization ." Mathematical programming 129.2 (2011): 163-195. 27 Xiao, Lin. "Dual averaging methods for regularized stochastic learning and online optimization ." The Journal of Machine Learning Research 9999 (2010): 2543-2596. 28 Bertsekas, Dimitri P., and John N. Tsitsiklis. "Parallel and distributed computation: numerical methods . 1989." Englewood Clifffs, New Jersey: Printice-Hall. 29 Agarwal, Alekh, et al. "A reliable effective terascale linear learning system ." arXiv preprint arXiv:1110.4198 (2011). 30 Teo, Choon Hui, et al. "A scalable modular convex solver for regularized risk minimization ." Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007.